Evaluating Content Extraction on Html Documents
نویسنده
چکیده
A variety of applications uses methods to determine and extract the main textual contents of an HTML document. The performance of the methods employed in this task is rarely evaluated. This paper fills this gap by introducing a platform independent and extensible framework for measuring, evaluating and comparing the performance of methods for Content Extraction. We further give an overview over extraction algorithms found in domain specific applications and present an adaptation of a related algorithm to perform Content Extraction. We compare the algorithms using the developed framework and show that our adapted algorithm performs best on most HTML documents.
منابع مشابه
Efficient Text Content Extraction and Browsing of WWW Documents Using the Abstract Text Viewer
The Abstract Text Viewer (ATV) is an integrated suite of text reading tools for electronic documents designed to increase efficiency and effectiveness of content extraction. ATV reads a HTML formatted document to create more abstract representations, such as a heading structure for overviews. The system uses both well-known techniques for text representation and novel display and content extrac...
متن کاملInformation Extraction from HTML Documents Based on Logical Document Structure
The World Wide Web presents the largest Internet source of information from a broad range of areas. The web documents are mostly written in the Hypertext Markup Language (HTML) that doesn’t contain any means for semantic description of the content and thus the contained information cannot be processed directly. Current approaches for the information extraction from HTML are mostly based on wrap...
متن کاملRobust Web Data Extraction with XML Path Expressions
Automated extraction of structured Web data has attracted considerable interest in both the academia and industry. A particularly promising approach is to employ XML technologies to translate semi-structured HTML documents to “pure” XML documents. In this approach, HTML documents are first normalized into XHMTL and then mapped to the desired XML application format by using XML path expressions ...
متن کاملOptimized Content Extraction from web pages using Composite Approaches
The information available today on web is tremendous and comes with greater challenges. Content extraction identifies the main content and removes the clutter from web pages. The main problem in extracting the content from the web page is the newer architecture of web pages and the diversity in the structure of web pages. Optimized content extraction from HTML documents using collective approac...
متن کاملKitten: a tool for normalizing HTML and extracting its textual content
The web is composed of a gigantic amount of documents that can be very useful for information extraction systems. Most of them are written in HTML and have to be rendered by an HTML engine in order to display the data they contain on a screen. HTML files thus mix both informational and rendering content. Our goal is to design a tool for informational content extraction. A linear extraction with...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007